Chris Bail
Duke University
website: https://www.chrisbail.net
github: https://github.com/cbail
Twitter: https://www.twitter.com/chris_bail
install.packages("rvest")
library(rvest)
We are going to begin by scraping this very simple web page from Wikipedia.
wikipedia_page<-read_html("https://en.wikipedia.org/wiki/World_Health_Organization_ranking_of_health_systems_in_2000")
wikipedia_page
{xml_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...
section_of_wikipedia<-html_node(wikipedia_page, xpath='//*[@id="mw-content-text"]/div/table')
head(section_of_wikipedia)
$node
<pointer: 0x7f7f99da9740>
$doc
<pointer: 0x7f7f99d9a390>
health_rankings<-html_table(section_of_wikipedia)
head(health_rankings[,(1:2)])
Country Attainment of goals / Health / Level (DALE)
1 Afghanistan 168
2 Albania 102
3 Algeria 84
4 Andorra 10
5 Angola 165
6 Antigua and Barbuda 48
1) pick another Wikipedia page.
2) Try to scrape some information of interest.
Hint: The html_text function in rvest may be useful if you wish to scrape text
duke_page<-read_html("https://www.duke.edu")
duke_events<-html_nodes(duke_page, css="li:nth-child(1) .epsilon")
html_text(duke_events)
[1] "In Appalachia, the Fight Against Opioid Abuse Comes to the Pulpit\n\n\t\t\t\t\t\t\t"
[2] "Get Started with Get Moving\n\n\t\t\t\t\t\t\t"
[3] "Duke Hospital Exhibit: Meet the Artist & Demonstration with painter Sally Sutton"
[4] "Nicholas School Students Speak Out on Inclusion"
devtools::install_github("ropensci/RSelenium")
library(Rselenium)
Note: you may need to install Java to get up and running see this tutorial
rD <- rsDriver()
remDr <- rD$client
remDr$navigate("https://www.duke.edu")
search_box <- remDr$findElement(using = 'css selector', 'fieldset input')
search_box$sendKeysToElement(list("data science", "\uE007"))
#create list of websites to scrape
my_list_of_websites<-c("www.duke.edu","www.penn.edu")
#create place to store text data
text_data<-as.data.frame(NULL)
#loop
for(i in 1: length(my_list_of_websites)){
#read in page and extract text
page<-read_html("https://www.duke.edu")
events<-html_nodes(page, css="li:nth-child(1) .epsilon")
text<-html_text(events)
#store text in dataset created above
text_data<-rbind(text_data,text)
#print iteration for de-bugging
print(i)
}
1) pick three web pages.
2) Write a loop that extracts some information from each of them using one of the techniques above.
Hint: The loop we just created does not work– can you figure out why?
Hint: Websites that have the same structure will be much easier
Hint: Most people will not be able to complete this exercise